具有更多数据,计算和参数的缩放语言模型在自然语言处理方面取得了重大进展。例如,由于缩放,GPT-3能够在内心学习任务上实现强烈结果。但是,培训这些大密度模型需要大量的计算资源。在本文中,我们提出并开发了名为Glam(通用语言模型)的语言模型系列,它使用稀疏激活的专家架构来规模模型容量,同时与致密变体相比,也产生显着更少的训练成本。最大的Glam具有1.2万亿参数,比GPT-3大约为7倍。它仅消耗了用于训练GPT-3的1/3的能量,并且需要一半的计算拖鞋进行推理,同时仍然在29个NLP任务中实现更好的整体零射击和一次性性能。
translated by 谷歌翻译
深度尖峰神经网络(SNNS)目前由于离散二进制激活和复杂的空间 - 时间动态而导致的基于梯度的方法的优化困难。考虑到Reset的巨大成功在深度学习中,将深入了解剩余学习,这将是自然的。以前的尖峰Reset模仿ANN的标准残留块,并简单地用尖刺神经元取代relu激活层,这遭受了劣化问题,并且很难实施剩余学习。在本文中,我们提出了尖峰元素 - 明智(SEW)RESET,以实现深部SNNS的剩余学习。我们证明SEW RESET可以轻松实现身份映射并克服Spiking Reset的消失/爆炸梯度问题。我们在Imagenet,DVS手势和CIFAR10-DVS数据集中评估我们的SEF RESET,并显示SEW RESNET以准确性和时间步长,最先进的直接训练的SNN。此外,SEW Reset通过简单地添加更多层来实现更高的性能,提供一种培训深舒头的简单方法。为了我们的最佳知识,这是第一次直接训练具有100多层以上的深度SNN。我们的代码可在https://github.com/fangwei123456/spike-element-wore-resnet上获得。
translated by 谷歌翻译
手术动作三胞胎识别提供了对手术场景的更好理解。这项任务具有很高的相关性,因为它为外科医生提供了背景感知的支持和安全性。当前改善绩效的首选策略是开发新的网络机制。但是,当前最新技术的性能大大低于其他手术任务。为什么会发生这种情况?这是我们在这项工作中解决的问题。我们提出了第一项研究,以了解现有的深度学习模型通过稳健性和解释的镜头的失败。首先,我们通过对抗优化方案研究了当前的现有模型。然后,我们通过基于功能的解释提供故障模式。我们的研究对提高性能和提高可靠性的关键是核心和虚假属性。我们的工作为外科科学中更具可信赖性和可靠性的深度学习模型打开了大门。
translated by 谷歌翻译
地震波的频域模拟在地震反演中起着重要作用,但在大型模型中仍然具有挑战性。作为有效的深度学习方法,最近提出的物理知识的神经网络(PINN)在解决广泛的偏微分方程(PDES)方面取得了成功的应用,并且在这方面仍然有改进的余地。例如,当PDE系数不平滑并描述结构复合介质时,PINN可能导致溶液不准确。在本文中,我们通过使用PINN而不是波方程来求解频域中的声学和Visco声学散射的场波方程,以消除源奇异性。我们首先说明,当在损失函数中未实现边界条件时,非平滑速度模型导致波场不准确。然后,我们在PINN的损耗函数中添加了完美匹配的层(PML)条件,并设计了二次神经网络,以克服PINN中非平滑模型的有害影响。我们表明,PML和二次神经元改善了结果和衰减,并讨论了这种改进的原因。我们还说明,在波场模拟中训练的网络可用于预先训练PDE-Coeff及时改变后另一个波场模拟的神经网络,并相应地提高收敛速度。当两次连续迭代或两个连续的实验之间的模型扰动时,这种预训练策略应在迭代全波形反转(FWI)和时置目标成像中找到应用。
translated by 谷歌翻译
联合学习是分布式机器学习领域中的一个新兴概念。这个概念使甘斯能够从保留隐私的同时从丰富的分布式培训数据中受益。但是,在非IID设置中,当前的联合GAN体系结构是不稳定的,努力学习独特的功能并容易崩溃。在本文中,我们提出了一种新型的体系结构多流体,以解决非IID数据集的低质量图像,模式崩溃和不稳定性的问题。我们的结果表明,与基线Flgan相比,多流通量是平均20多个客户的稳定且性能的四倍。
translated by 谷歌翻译
多芯片芯片模块(MCM),而票面上提供性能和能效的单片大芯片减少了机器学习(ML)加速器的设计和制造成本。然而,统计MCM的ML编译器需要最佳,有效地解决复杂的优化问题,以实现这种高性能。其中一个问题是多芯片分割问题,在编译器确定在小芯片的MCM张计算图形操作的最佳分配和安置。作为搜索空间可用芯片的数目和节点的神经网络在数量呈指数级增长分区ML图形的多芯片模块是特别难。此外,由底层硬件施加的约束产生了一个有效解决方案非常稀疏的搜索空间。在本文中,我们提出使用深强化学习(RL)框架来发出可能无效分区候选人,然后由约束求解修正的策略。使用约束求解器可确保RL遇到稀疏空间中的有效解决方案,其经常足以与未经学习的策略相比较少的样本收敛。我们为策略网络制作的架构选择允许我们拓展不同的ML图形。我们的生产规模的模型,BERT,在真实的硬件的评估表明,使用RL政策所产生的分区达到6.11%和5.85%,比吞吐量随机搜索和模拟退火更高。此外,微调预训练RL政策减少了3小时至只有9分钟的搜索时间,同时实现了相同的吞吐量从头训练RL政策。
translated by 谷歌翻译
Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
translated by 谷歌翻译
Deep learning models can achieve high accuracy when trained on large amounts of labeled data. However, real-world scenarios often involve several challenges: Training data may become available in installments, may originate from multiple different domains, and may not contain labels for training. Certain settings, for instance medical applications, often involve further restrictions that prohibit retention of previously seen data due to privacy regulations. In this work, to address such challenges, we study unsupervised segmentation in continual learning scenarios that involve domain shift. To that end, we introduce GarDA (Generative Appearance Replay for continual Domain Adaptation), a generative-replay based approach that can adapt a segmentation model sequentially to new domains with unlabeled data. In contrast to single-step unsupervised domain adaptation (UDA), continual adaptation to a sequence of domains enables leveraging and consolidation of information from multiple domains. Unlike previous approaches in incremental UDA, our method does not require access to previously seen data, making it applicable in many practical scenarios. We evaluate GarDA on two datasets with different organs and modalities, where it substantially outperforms existing techniques.
translated by 谷歌翻译
The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.
translated by 谷歌翻译
As one of the prevalent methods to achieve automation systems, Imitation Learning (IL) presents a promising performance in a wide range of domains. However, despite the considerable improvement in policy performance, the corresponding research on the explainability of IL models is still limited. Inspired by the recent approaches in explainable artificial intelligence methods, we proposed a model-agnostic explaining framework for IL models called R2RISE. R2RISE aims to explain the overall policy performance with respect to the frames in demonstrations. It iteratively retrains the black-box IL model from the randomized masked demonstrations and uses the conventional evaluation outcome environment returns as the coefficient to build an importance map. We also conducted experiments to investigate three major questions concerning frames' importance equality, the effectiveness of the importance map, and connections between importance maps from different IL models. The result shows that R2RISE successfully distinguishes important frames from the demonstrations.
translated by 谷歌翻译